Link to data source: https://www.kaggle.com/aungpyaeap/supermarket-sales
Context
The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.
Data Dictionary
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
City: Location of supercenters
Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
Gender: Gender type of customer
Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in USD
Quantity: Number of products purchased by customer
Tax: 5% tax fee for customer buying
Total: Total price including tax
Date: Date of purchase (Record available from January 2019 to March 2019)
Time: Purchase time (10am to 9pm)
Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
COGS: Cost of goods sold
Gross margin percentage: Gross margin percentage
Gross income: Gross income
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calmap
from pandas_profiling import ProfileReport
df=pd.read_csv('supermarket_sales.csv')
df.head() #visualize the first 5 rows
df.tail() #visualize the last 5 rows
df.columns #inspect the name of the columsn
df.dtypes
Analysis:
df['Date']=pd.to_datetime(df['Date'])
df.dtypes
df.set_index('Date', inplace=True)
df.head()
df.describe()
df.duplicated().sum()
Analysis:
# let's check duplicated lines
df[df.duplicated()==True]
#Remove duplicated
df.drop_duplicates(inplace=True)
df.duplicated().sum()
print(df.isna().sum())
print('__________________')
print(df.isna().sum()/len(df)*100)
#Visualizing missing values
sns.heatmap(df.isnull(), cbar=False)
Analysis:
#Feeling the missing values using the mean will only work for numeric variables
df.fillna(np.mean,inplace=True)
sns.heatmap(df.isnull(), cbar=False)
#if our goal is the replace missing values on a qualitative or non numeric variable, them we must replace by the mode
df.fillna(df.mode(). iloc[0])
sns.heatmap(df.isnull(), cbar=False)
What does the distribution of customer ratings looks like? Is it skewed?
#plotting the distribution of the ratings
sns.distplot(df['Rating'])
plt.axvline(x=np.mean(df['Rating']),c='red',ls='--')
plt.axvline(x=np.percentile (df['Rating'],25),c='green',ls='--', label='25-75 percentile')
plt.axvline(x=np.percentile (df['Rating'],75),c='green',ls='--')
Analysis:
df.hist(figsize=(10,10))
Do aggregate sales numbers differ by much between branches?
sns.countplot(df['Branch'])
Analysis:
Do aggregate sales numbers differ by much between type of payments?
sns.countplot(df['Payment'])
Analysis:
Question 3: Is there a relationship between gross income and customer ratings?
sns.scatterplot(df['Rating'],df['gross income'])
Analysis:
Is there a difference Rating across the different branch and also the amount customers spend on the different branches
sns.boxplot(x=df['Branch'], y=df['gross income'])
Analysis:
sns.boxplot(x=df['Gender'], y=df['gross income'])
Analysis:
Is there a noticeable time trend in gross income?
To check the trend, as we have multiple sales per day, the first step will be to aggregate the dats
df.groupby(df.index).mean().head()
sns.lineplot(x=df.groupby(df.index).mean().index, y=df.groupby(df.index).mean()['gross income'])
there is no trend, the sales varie around the same mean(mean constant across time), but as we are only working with a 3 months, would be more interesting to have more data to confirm presence or absence of a trend on our data
sns.pairplot(df)
round(np.corrcoef(df['gross income'], df['Rating'])[1][0],2)
sns.heatmap(np.round(df.corr(),2), annot=True)
dataset=pd.read_csv('supermarket_sales.csv')
#Correlation matrix
np.round(df.corr(),2)
#from pandas_profiling import ProfileReport
prof = ProfileReport(dataset, title='Pandas Profiling Report', explorative=True)
prof